nvidia

nemoretriever-table-structure-v1

Run Anywhere

Model for object detection, fine-tuned to detect charts, tables, and titles in documents.

chart detection object detection table detection data ingestion nemo retriever

Get API Key

API Reference

Model Overview

Description

The NeMo Retriever Table Structure v1 model is a specialized object detection model designed to identify and extract the structure of tables in images. Based on YOLOX, an anchor-free version of YOLO (You Only Look Once), this model combines a simpler architecture with enhanced performance. While the underlying technology builds upon work from Megvii Technology, we developed our own base model through complete retraining rather than using pre-trained weights.

The model excels at detecting and localizing the fundamental structural elements within tables. Through careful fine-tuning, it can accurately identify and delineate three key components within tables:

Individual cells (including merged cells)
Rows
Columns

This specialized focus on table structure enables precise decomposition of complex tables into their constituent parts, forming the foundation for downstream retrieval tasks. This model helps convert tables into the markdown format which can improve retrieval accuracy.

This model is ready for commercial use and is a part of the NVIDIA NeMo Retriever family of NIM microservices specifically for object detection and multimodal extraction of enterprise documents.

License/Terms of use

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement.

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.

Model Architecture

Architecture Type: YOLOX
Network Architecture: DarkNet53 Backbone + FPN Decoupled head (one 1x1 convolution + 2 parallel 3x3 convolutions (one for the classification and one for the bounding box prediction). The YOLOX architecture is a single-stage object detector that improves on Yolo-v3.
Deployment Geography: Global

Use Case:
This model specializes in analyzing images containing tables by:

Detecting and extracting table structure elements (rows, columns, and cells)
Providing precise location information for each detected element
Supporting downstream tasks like table analysis and data extraction

The model is designed to work in conjunction with OCR (Optical Character Recognition) systems to:

Identify the structural layout of tables
Preserve the relationships between table elements
Enable accurate extraction of tabular data from images

Ideal for:

Document processing systems
Automated data extraction pipelines
Digital content management solutions
Business intelligence applications

Release Date: 2025-03-17

Technical Details

Input

Input type(s): Image
Input format(s): Red, Green, Blue (RGB)
Input parameters: Two Dimensional (2D)
Other properties related to input: Image size resized to (1024, 1024)

Output

Output Type(s): Array
Output Format: A dictionary of dictionaries containing np.ndarray objects. The outer dictionary contains each sample (table). Inner dictionary contains list of dictionaries with bounding boxes, class, and confidence for that table
Output Parameters: 1D
Other Properties Related to Output: Output contains Bounding box, detection confidence and object class (cell, row, column). Thresholds used for non-maximum suppression conf_thresh = 0.01; iou_thresh = 0.25

Software Integration

Runtime: NeMo Retriever Table Structure v1 NIM
Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace
Supported Operating System(s): Linux

Model Version(s):

nemoretriever-table-structure-v1

Training Dataset & Evaluation

Training Dataset

Data collection method by dataset: Automated
Labeling method by dataset: Automated
Pretraining: COCO train2017 Finetuning (by NVIDIA): 23,977 images from Digital Corpora dataset, with annotations from Azure AI Document Intelligence. Number of bounding boxes per class: 1,828,978 cells, 134,089 columns and 316,901 rows. The layout model of Document Intelligence was used with 2024-02-29-preview API version.

Evaluation Results

The primary evaluation set: 2,459 digital corpora images with Azure labels. Number of bounding boxes per class: 200,840 cells, 13,670 columns and 34,575 rows. mAP was used as an evaluation metric.
Data collection method by dataset: Hybrid: Automated, Human
Labeling method by dataset: Hybrid: Automated, Human
Properties: We evaluated with Azure labels from manually selected pages, as well as manual inspection on public PDFs and powerpoint slides.

Per-class Performance Metrics:

Class	Average Precision (%)	Average Recall (%)
cell	58.365	60.647
row	76.992	81.115
column	85.293	87.434

Inference:

Engine: TensorRT.
Test hardware: See Support Matrix from NIM documentation.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.