## Model Overview ### Description The **NeMo Retriever Page Elements v2** model is a specialized object detection model designed to identify and extract key elements from charts and graphs. While the underlying technology builds upon work from [Megvii Technology](https://github.com/Megvii-BaseDetection/YOLOX), we developed our own base model through complete retraining rather than using pre-trained weights. YOLOX is an anchor-free version of YOLO (You Only Look Once), this model combines a simpler architecture with enhanced performance. The model is trained to detect **tables**, **charts**, **infographics**, and **titles** in documents. This model supersedes the [nv-yolox-page-elements](https://build.nvidia.com/nvidia/nv-yolox-page-elements-v1) model. This model is ready for commercial use and is a part of the NVIDIA NeMo Retriever family of NIM microservices specifically for object detection and multimodal extraction of enterprise documents. ### License/Terms of use The use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). **You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.** ### Model Architecture **Architecture Type**: YOLOX
**Network Architecture**: DarkNet53 Backbone \+ FPN Decoupled head (one 1x1 convolution \+ 2 parallel 3x3 convolutions (one for the classification and one for the bounding box prediction). YOLOX is a single-stage object detector that improves on Yolo-v3.
**Deployment Geography**: Global **Use Case**:
This model is designed for automating extraction of charts, tables, infographics, and titles in enterprise documents. Key applications include: - Enterprise document extraction, embedding and indexing - Augmenting Retrieval Augmented Generation (RAG) workflows with multimodal retrieval - Data extraction from legacy documents and reports **Release Date**: 2025-03-17 ### Intended use The **NeMo Retriever Page Elements v2** model is suitable for users who want to extract, and ultimately retrieve, tables, charts and infographics. It can be used for document analysis, understanding and processing. ## Technical Details ### Input **Input Type(s)**: Image
**Input Format(s)**: Red, Green, Blue (RGB)
**Input Parameters**: Two Dimensional (2D)
**Other Properties Related to Input**: Image size resized to `(1024, 1024)` ### Output **Output Type(s)**: Array
**Output Format**: A dictionary of dictionaries containing `np.ndarray`. The outer dictionary contains each sample (page). Inner dictionary contains list of dictionaries with bounding boxes, class, and confidence for that page
**Output Parameters**: 1D
**Other Properties Related to Output**: Output contains Bounding box, detection confidence and object class (chart, table, infographic, title). Thresholds used for non-maximum suppression `conf_thresh = 0.01`; `iou_thresh = 0.5`
**Output Classes**:
* Table * Data structured in rows and columns * Chart * Specifically bar charts, line charts, or pie charts * Infographic * Visual representations of information that is more complex than a chart, including diagrams and flowcharts * Maps are _not_ considered infographics * Title * Titles can be page titles, section titles, or table/chart/infographic titles ### Software Integration **Runtime**: **NeMo Retriever Page Elements v2** NIM
**Supported Hardware Microarchitecture Compatibility**: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace
**Supported Operating System(s)**: Linux
## Model Version(s): * `nemoretriever-page-elements-v2` ## Training Dataset & Evaluation ### Training Dataset **Data collection method by dataset**: Automated
**Labeling method by dataset**: Hybrid: Automated, Human
**Pretraining (by NVIDIA)**: 118,287 images of the [COCO train2017](https://cocodataset.org/#download) dataset
**Finetuning (by NVIDIA)**: 36,093 images from [Digital Corpora dataset](https://digitalcorpora.org/), with annotations from [Azure AI Document Intelligence](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence) and data annotation team
**Number of bounding boxes per class**: 35,328 tables, 44,178 titles, 11,313 charts and 6,500 infographics. The layout model of Document Intelligence was used with `2024-02-29-preview` API version. ### Evaluation Results The primary evaluation set is a cut of the Azure labels and digital corpora images. Number of bounding boxes per class: 1,483 tables, 1,965 titles, 404 charts and 500 infographics. Mean Average Precision (mAP) was used as an evaluation metric, which measures the model's ability to correctly identify and localize objects across different confidence thresholds. **Data collection method by dataset**: Hybrid: Automated, Human
**Labeling method by dataset**: Hybrid: Automated, Human
**Properties**: We evaluated with Azure labels from manually selected pages, as well as manual inspection on public PDFs and powerpoint slides. **Per-class Performance Metrics**: | Class | AP (%) | AR (%) | |:------------|:-------|:-------| | table | 45.619 | 69.814 | | chart | 53.419 | 75.755 | | title | 45.116 | 65.245 | | infographic | 96.591 | 97.400 | ## Inference: **Engine**: TensorRT
**Test hardware**: See Support Matrix from NIM documentation ## Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. **For more detailed information on ethical considerations for this model**, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).