## Model Overview ### Description YOLOX is an anchor-free version of YOLO (You Only Look Once) one-shot object detection model series, with a simpler design, better performance and less restrictive license. It’s from Megvii Technology. This model is a YOLOX-L version fine-tuned on 26,000 images from [Digital Corpora dataset](https://digitalcorpora.org/), with annotations from Azure AI Document Intelligence. The model is trained to detect tables, charts and titles in documents. This model is ready for commercial use. ### License/Terms of use Use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/#:~:text=This%20license%20agreement%20(%E2%80%9CAgreement%E2%80%9D,algorithms%2C%20parameters%2C%20configuration%20files%2C)) and the [Apache 2.0 License](https://github.com/Megvii-BaseDetection/YOLOX/blob/main/LICENSE). ### Model Architecture **Architecture Type:** Yolox
**Network Architecture:** DarkNet53 Backbone + FPN Decoupled head (one 1x1 convolution + 2 parallel 3x3 convolutions (one for the classification and one for the bounding box prediction)
YOLOX is a single-stage object detector that improves on Yolo-v3. The model is fine-tuned to detect three classes of objects in documents: table, chart, title. Chart is defined as a bar chart, line chart or pie chart. Titles can be page titles, section titles, or table/chart titles. ### Model Version(s) Short name: YOLOX Document Structure Detection ### Intended use YOLOX Model is suitable for users who want to extract tables, text titles and charts from documents. It can be used for document analysis, document understanding, and document processing. The goal is to extract structural elements (tables and charts) from the page to allow vision models to be applied for information extraction. ## Technical Details ### Input **Input Type(s):** Image
**Input Format(s):** Red, Green, Blue (RGB)
**Input Parameters:** Two Dimensional (2D)
**Other Properties Related to Input:** Image size resized to (1024, 1024)
### Output **Output Type(s):** Array
**Output Format:** dict of dictionaries containing np.ndarray. Outer dictionary contains each sample (page). Inner dictionary contains list of dictionaries with bboxes, type and confidence for that page
**Output Parameters:** n/a
**Other Properties Related to Output:** Output contains Bounding box, detection confidence and object class (chart, table, title). Thresholds used for nms - conf_thresh = 0.01; iou_thresh = 0.5; max_per_img = 100; min_per_img = 0
### Software Integration **Runtime:** NeMo Retriever YOLOX Structured Images NIM
**Supported Hardware Microarchitecture Compatibility:** NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace
## Supported Operating System(s): * Linux
## Model Version(s): * nvidia/nv-yolox-structured-images-v1
## Training Dataset & Evaluation ### Training Dataset **Data Collection Method by dataset:** Automated
**Labeling Method by dataset:** Automated
Pretraining: [COCO train2017](https://cocodataset.org/#download) Finetuning (by NVIDIA): 25,832 images from [Digital Corpora dataset](https://digitalcorpora.org/), with annotations from [Azure AI Document Intelligence](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence). Number of bounding boxes per class: 30,099 tables, 34,369 titles and 8,363 charts. The layout model of Document Intelligence was used with `2024-02-29-preview` api version. ### Evaluation Results The primary evaluation set is a cut of the azure labels and digital corpora images. Number of bounding boxes per class: 1704 tables, 1906 titles and 367 charts. mAP was used as an evaluation metric. **Data Collection Method by dataset:** Automated
**Labeling Method by dataset:** Automated, Human
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** We evaluated with azure labels from held out pages, as well as manual inspection on public PDFs and powerpoint slides. ## Inference: **Engine:** TensorRT
**Test Hardware:** See Support Matrix from NIM documentation.
## Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Model Card++ ### Bias | | | | :--------------------------------------------------------------------------------------------------------------------------------------------------------------: | :----------: | | **Field** | **Response** | | Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None | | Measures taken to mitigate against unwanted bias | None | ### Explainability | Field | Response | | :----------------------------: |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Intended Application & Domain: | Document Understanding | | Model Type: | Object Detection | | Intended User: | Enterprise developers who need to organise internal documentation | | Output: | Array of float numbers(with localisation information) | | Describe how the model works: | Model detects charts, tables and titles in an image. | | Verified to have met prescribed quality standards: | Yes | Performance Metrics: | Accuracy, Throughput, and Latency | | Potential Known Risks: | This model does not always guarantee to extract all entities in an image. | | Licensing & Terms of Use: | [NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license) and the [MIT License (MIT)](https://github.com/microsoft/unilm/blob/master/LICENSE). | | Technical Limitations | The model may not generalize to unknown document types not commonly found on the web. | ### Privacy | Field | Response | | :---------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------: | | Generatable or reverse engineerable personally-identifiable information (PII)? | Neither | | Was consent obtained for any personal data used? | Not Applicable | | Personal data used to create this model? | None | | How often is the dataset reviewed? | Before Every Release | | Is a mechanism in place to honor data subject right of access or deletion of personal data? | No | | If personal data was collected for the development of the model, was it collected directly by NVIDIA? | Not Applicable | | If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | Not Applicable | | If personal data was collected for the development of this AI model, was it minimized to only what was required? | Not Applicable | | Is there provenance for all datasets used in training? | Yes | | Does data labeling (annotation, metadata) comply with privacy laws? | Yes | | Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. | ### Safety and Security | Field | Response | | :------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | Model Application(s): | Text Embedding for Retrieval | | Describe the physical safety impact (if present). | Not Applicable | | Use Case Restrictions: | Commercial Abide by [NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license). | | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |