nemotron-page-elements-v3

Description

nemotron-page-elements-v3 is a specialized object detection model designed to identify and extract key page elements in documents, including tables, charts, infographics, titles, header/footers, and text regions. It supports document analysis and multimodal extraction workflows used in enterprise document understanding and retrieval.

This model is ready for commercial/non-commercial use.

License and Terms of Use:

GOVERNING TERMS: The trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement.

You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.

Model Developer: NVIDIA

Deployment Geography:

Global

Use Case:

This model is designed for automating extraction of page elements in enterprise documents, including:

Enterprise document extraction, embedding, and indexing
Augmenting Retrieval Augmented Generation (RAG) workflows with multimodal retrieval
Data extraction from legacy documents and reports

This model supersedes the nemoretriever-page-elements-v2 model.

Release Date:

Build.NVIDIA.com 03/02/2026 via nemotron-page-elements-v3

Reference(s):

References:

Model Architecture:

Architecture Type: YOLOX
Network Architecture: DarkNet53 Backbone + FPN decoupled head (one 1x1 convolution + 2 parallel 3x3 convolutions: one for classification and one for bounding box prediction)
Number of Model Parameters: ~5.4e7
Input Resize: (1024, 1024)

Input:

Input Types: Image
Input Formats: RGB
Input Parameters: Two Dimensional (2D)
Other Input Properties: Image is resized to (1024, 1024).

Output:

Output Types: Structured detections (bounding boxes + labels + confidence)
Output Format: JSON-compatible structure
Output Parameters: One Dimensional (1D)
Other Output Properties: Outputs bounding boxes, confidence scores, and object classes (chart, table, infographic, title, text, header/footer). Thresholds used for non-maximum suppression: conf_thresh = 0.01; iou_thresh = 0.5.

Output Classes:

Table: Data structured in rows and columns
Chart: Specifically bar charts, line charts, or pie charts
Infographic: Visual representations of information more complex than a chart (diagrams, flowcharts); maps are not considered infographics
Title: Section titles, or table/chart/infographic titles
Header/footer: Page headers and footers
Text: Regions of one or more text paragraphs, or standalone text not belonging to any of the classes above

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engines: TensorRT
Supported Hardware Microarchitecture Compatibility:
NVIDIA Ampere
NVIDIA Hopper
NVIDIA Lovelace
Operating Systems: Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

nemotron-page-elements-v3

Short Name: nemotron-page-elements-v3

Training and Evaluation Datasets:

Training Dataset

Data Modality: Image
Image Training Data Size: Less than a Million Images
Training Data Collection: Automated
Training Labeling: Hybrid (Automated, Human)
Training Properties: Pretrained on 118,287 images from COCO train2017 and fine-tuned on 36,093 images from the Digital Corpora dataset, with annotations from Azure AI Document Intelligence and a data annotation team. Bounding boxes per class: 35,328 tables, 44,178 titles, 11,313 charts, 6,500 infographics, 90,812 texts, and 10,743 header/footers. The layout model of Document Intelligence was used with 2024-02-29-preview API version.

Evaluation Dataset

Evaluation Data Collection: Hybrid (Automated, Human)
Evaluation Labeling: Hybrid (Automated, Human)
Evaluation Properties: The primary evaluation set is a cut of Azure labels and Digital Corpora images. Bounding boxes per class: 1,985 tables, 2,922 titles, 498 charts, 572 infographics, 4,400 texts, and 492 header/footers. Mean Average Precision (mAP) was used as an evaluation metric. We evaluated with Azure labels from manually selected pages, as well as manual inspection on public PDFs and PowerPoint slides.

Per-class Performance Metrics:

Class	AP (%)	AR (%)
table	44.643	62.242
chart	54.191	77.557
title	38.529	56.315
infographic	66.863	69.306
text	45.418	73.017
header_footer	53.895	75.670

Inference

Acceleration Engine: TensorRT
Test Hardware: NVIDIA Hopper (H100 PCIe/SXM)

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case, and address unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.