## **Model Overview** ### **Description** The **NeMo Retriever Graphic Elements v1** model is a specialized object detection system designed to identify and extract key elements from charts and graphs. Based on YOLOX, an anchor-free version of YOLO (You Only Look Once), this model combines a simpler architecture with enhanced performance. While the underlying technology builds upon work from [Megvii Technology](https://github.com/Megvii-BaseDetection/YOLOX), we developed our own base model through complete retraining rather than using pre-trained weights. The model excels at detecting and localizing various graphic elements within chart images, including titles, axis labels, legends, and data point annotations. This capability makes it particularly valuable for document understanding tasks and automated data extraction from visual content. This model is ready for commercial use and is a part of the NVIDIA NeMo Retriever family of NIM microservices specifically for object detection and multimodal extraction of enterprise documents. This model supersedes the [CACHED](https://build.nvidia.com/university-at-buffalo/cached) model. ### **License/Terms of use** Use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/). **You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.** **Deployment Geography**: Global **Use Case**:
This model is designed for automating extraction of graphic elements of charts in enterprise documents. Key applications include: - Enterprise document extraction, embedding and indexing - Augmenting Retrieval Augmented Generation (RAG) workflows with multimodal retrieval - Data extraction from legacy documents and reports **Release Date**: 2025-03-17 ### **Model Architecture** **Architecture type:** YOLOX
**Network architecture:** DarkNet53 Backbone \+ FPN Decoupled head (one 1x1 convolution \+ 2 parallel 3x3 convolutions (one for the classification and one for the bounding box prediction) YOLOX is a single-stage object detector that improves on Yolo-v3. The model is fine-tuned to detect 10 classes of objects in documents: 1. Chart title 1. X-axis title 1. Y-axis title 1. X-axis label(s) 1. Y-axis label(s) 1. Legend label(s) 1. Legend title 1. Markings and values labels 1. Miscellaneous other texts on the chart ## **Input** **Input type(s):** Image
**Input format(s):** Red, Green, Blue (RGB)
**Input parameters:** Two Dimensional (2D)
**Other properties related to input:** Expected input is a `np.ndarray` image of shape `[Channel, Width, Height]`, or an `np.ndarray` batch of image of shape `[Batch, Channel, Width, Height]`. ## **Output** **Output type(s):** Text associated to each of the following classes :
* `["chart_title", "x_title", "y_title", "xlabel", "ylabel", "other", "legend_label", "legend_title", "mark_label", "value_label"]` **Output format:** Dict of String
**Output parameters:** 1D
**Other properties related to output:** None ### Software Integration **Runtime Engine**: **NeMo Retriever Graphic Elements v1** NIM
**Supported Hardware Microarchitecture Compatibility**: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace
**Supported Operating System(s)**: Linux ## **Model Version(s):** * `nemoretriever-graphic-elements-v1` ## **Training Dataset:** * PubMed Central (PMC) Chart Dataset * **Link:** [https://chartinfo.github.io/index\_2022.html](https://chartinfo.github.io/index_2022.html) * **Data collection method:** Automated, Human * **Labeling method**: Human * **Description:** A real-world dataset collected from PubMed Central Documents and manually annotated, released in the ICPR 2022 CHART-Infographic competition. There are 5,614 images for chart element detection, 4,293 images for final plot detection and data extraction, and 22,924 images for chart classification. * DeepRule dataset * **Link:** [https://github.com/soap117/DeepRule](https://github.com/soap117/DeepRule) * **Data collection method:** Automated, Human * **Labeling method**: Distillation by the CACHED model * **Description:** The original dataset consists of 386,966 chart images obtained by crawling public Excel sheets from the web with texts overwritten to protect privacy. The CACHED model is used to pseudo-label the relevant classes. We used a subsample of 9,091 charts where a title was detected for training alongside with the 5,614 PMC training images. ## **Evaluation Results** Results were evaluated using the **PMC Chart dataset**. The **Mean Average Precision (mAP)** was used as the evaluation metric to measure the model's ability to correctly identify and localize objects across different confidence thresholds. ### **Data Collection & Labeling** - **Data collection method:** **Hybrid (Automated & Human)** - **Labeling method:** **Hybrid (Automated & Human)** - **Properties:** The validation dataset is the same as the **PMC Chart dataset**. ### **Dataset Overview** **Number of bounding boxes and images per class:** | **Label** | **Images** | **Boxes** | |----------------|----------:|---------:| | **chart_title** | 38 | 38 | | **legend_label** | 318 | 1077 | | **legend_title** | 17 | 19 | | **mark_label** | 42 | 219 | | **other** | 113 | 464 | | **value_label** | 52 | 726 | | **x_title** | 404 | 437 | | **xlabel** | 553 | 4091 | | **y_title** | 502 | 505 | | **ylabel** | 534 | 3944 | | **Total** | 560 | **11,520** | ### **Per-Class Performance Metrics** #### **Average Precision (AP)** | **Class** | **AP** | **Class** | **AP** | **Class** | **AP** | |----------------|---------:|----------------|---------:|---------------|---------:| | **chart_title** | 82.38 | **x_title** | 88.77 | **y_title** | 89.48 | | **xlabel** | 85.04 | **ylabel** | 86.22 | **other** | 55.14 | | **legend_label** | 84.09 | **legend_title** | 60.61 | **mark_label** | 49.31 | | **value_label** | 62.66 | | | | | #### **Average Recall (AR)** | **Class** | **AR** | **Class** | **AR** | **Class** | **AR** | |----------------|---------:|----------------|---------:|---------------|---------:| | **chart_title** | 93.16 | **x_title** | 92.31 | **y_title** | 92.32 | | **xlabel** | 88.93 | **ylabel** | 89.40 | **other** | 79.48 | | **legend_label** | 88.07 | **legend_title** | 68.42 | **mark_label** | 73.61 | | **value_label** | 68.32 | | | | | ## **Inference:** **Engine:** Tensor(RT)
**Test hardware:** Tested on all supported hardware listed in compatibility section ## **Ethical Considerations:** NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. **For more detailed information on ethical considerations for this model**, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).