nvidia/nv-yolox-page-elements-v1

PREVIEW

Model for object detection, fine-tuned to detect charts, tables, and titles in documents.

Model Overview

Description

YOLOX is an anchor-free version of YOLO (You Only Look Once) one-shot object detection model series, with a simpler design, better performance and less restrictive license. It’s from Megvii Technology. This model is a YOLOX-L version fine-tuned on 26,000 images from Digital Corpora dataset, with annotations from Azure AI Document Intelligence. The model is trained to detect tables, charts and titles in documents.

This model is ready for commercial use.

License/Terms of use

Use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement and the Apache 2.0 License.

Model Architecture

Architecture Type: Yolox
Network Architecture: DarkNet53 Backbone + FPN Decoupled head (one 1x1 convolution + 2 parallel 3x3 convolutions (one for the classification and one for the bounding box prediction)

YOLOX is a single-stage object detector that improves on Yolo-v3. The model is fine-tuned to detect three classes of objects in documents: table, chart, title. Chart is defined as a bar chart, line chart or pie chart. Titles can be page titles, section titles, or table/chart titles.

Model Version(s)

Short name: YOLOX Document Structure Detection

Intended use

YOLOX Model is suitable for users who want to extract tables, text titles and charts from documents. It can be used for document analysis, document understanding, and document processing. The goal is to extract structural elements (tables and charts) from the page to allow vision models to be applied for information extraction.

Technical Details

Input

Input Type(s): Image
Input Format(s): Red, Green, Blue (RGB)
Input Parameters: Two Dimensional (2D)
Other Properties Related to Input: Image size resized to (1024, 1024)

Output

Output Type(s): Array
Output Format: dict of dictionaries containing np.ndarray. Outer dictionary contains each sample (page). Inner dictionary contains list of dictionaries with bboxes, type and confidence for that page
Output Parameters: n/a
Other Properties Related to Output: Output contains Bounding box, detection confidence and object class (chart, table, title). Thresholds used for nms - conf_thresh = 0.01; iou_thresh = 0.5; max_per_img = 100; min_per_img = 0

Software Integration

Runtime: NeMo Retriever YOLOX Structured Images NIM
Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace

Supported Operating System(s):

  • Linux

Model Version(s):

  • nvidia/nv-yolox-structured-images-v1

Training Dataset & Evaluation

Training Dataset

Data Collection Method by dataset: Automated
Labeling Method by dataset: Automated

Pretraining: COCO train2017

Finetuning (by NVIDIA): 25,832 images from Digital Corpora dataset, with annotations from Azure AI Document Intelligence.

Number of bounding boxes per class: 30,099 tables, 34,369 titles and 8,363 charts. The layout model of Document Intelligence was used with 2024-02-29-preview api version.

Evaluation Results

The primary evaluation set is a cut of the azure labels and digital corpora images. Number of bounding boxes per class: 1704 tables, 1906 titles and 367 charts. mAP was used as an evaluation metric.

Data Collection Method by dataset: Automated
Labeling Method by dataset: Automated, Human
Properties (Quantity, Dataset Descriptions, Sensor(s)): We evaluated with azure labels from held out pages, as well as manual inspection on public PDFs and powerpoint slides.

Inference:

Engine: TensorRT
Test Hardware: See Support Matrix from NIM documentation.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

Model Card++

Bias

FieldResponse
Participation considerations from adversely impacted groups protected classes in model design and testingNone
Measures taken to mitigate against unwanted biasNone

Explainability

FieldResponse
Intended Application & Domain:Document Understanding
Model Type:Object Detection
Intended User:Enterprise developers who need to organise internal documentation
Output:Array of float numbers(with localisation information)
Describe how the model works:Model detects charts, tables and titles in an image.
Verified to have met prescribed quality standards:Yes
Performance Metrics:Accuracy, Throughput, and Latency
Potential Known Risks:This model does not always guarantee to extract all entities in an image.
Licensing & Terms of Use:NVIDIA AI Foundation Models Community License Agreement and the MIT License (MIT).
Technical LimitationsThe model may not generalize to unknown document types not commonly found on the web.

Privacy

FieldResponse
Generatable or reverse engineerable personally-identifiable information (PII)?Neither
Was consent obtained for any personal data used?Not Applicable
Personal data used to create this model?None
How often is the dataset reviewed?Before Every Release
Is a mechanism in place to honor data subject right of access or deletion of personal data?No
If personal data was collected for the development of the model, was it collected directly by NVIDIA?Not Applicable
If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects?Not Applicable
If personal data was collected for the development of this AI model, was it minimized to only what was required?Not Applicable
Is there provenance for all datasets used in training?Yes
Does data labeling (annotation, metadata) comply with privacy laws?Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made?No, not possible with externally-sourced data.

Safety and Security

FieldResponse
Model Application(s):Text Embedding for Retrieval
Describe the physical safety impact (if present).Not Applicable
Use Case Restrictions:Commercial Abide by NVIDIA AI Foundation Models Community License Agreement.
Model and dataset restrictions:The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.