---
title: "ocdrnet"
publisher: "nvidia"
type: "endpoint"
updated: "2024-08-26T16:47:16.487Z"
description: "OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively."
canonical: "https://build.nvidia.com/nvidia/ocdrnet"
---

This model card combines the relevant information of OCR and OCD models

# OCRNet Model Overview

## Description <a class="anchor" name="description"></a>
Optical character recognition network recognizes characters from the gray images.

## Terms of use <a class="anchor" name="terms_of_use"></a>
License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these [licenses](https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/).

## References(s): <a class="anchor" name="references"></a>
### Citations <a class="anchor" name="citations"></a>
- Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., ... & Lee, H. (2019). What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4715-4723).
- Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., & Kadlec, B. (2017, July). Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In SUNw: Scene Understanding Workshop-CVPR (Vol. 2017, p. 5).
- Singh, A., Pang, G., Toh, M., Huang, J., Galuba, W., & Hassner, T. (2021). Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8802-8812)
- Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." In: Proceedings of the 23rd international conference on Machine learning (2006)
- He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR (2015)
- Zhou, D., Yu, Z., Xie, E., Xiao, C., Anandkumar, A., Feng, J., & Alvarez, J. M. (2022, June). Understanding the robustness in vision transformers. In International Conference on Machine Learning (pp. 27378-27394). PMLR.
- Kuo, C. W., Ashmore, J. D., Huggins, D., & Kira, Z. (2019, January). Data-efficient graph embedding learning for PCB component detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 551-560). IEEE.

## Model Architecture: <a class="anchor" name="model_architecture"></a>
**Architecture Type:** Convolution Neural Network (CNN) <br>
**Network Architecture:** ResNet50 <br>
**Model Version:** <br>
- **trainable_v1.0** - Pre-trained model with ResNet backbone on scene text.
- **deployable_v1.0** - Models deployable with ResNet backbone.
- **trainable_v2.0** - Pre-trained model with FAN backbone on scene text.
- **deployable_v2.0** - Model deployable with FAN backbone on scene text.
- **trainable_v2.1** - Pre-trained model with FAN backbone on PCB text.
- **deployable_v2.1** - Model deployable with FAN backbone on PCB text.

## Input: <a class="anchor" name="input"></a>
**Input Type(s):** Image <br>
**Input Format:** Gray Image <br>
**Input Parameters:** 3D <br>
**Other Properties Related to Input:** <br>
- Gray Images of 1 X 32 X 100 (C H W) for trainable_v1.0/deployable_v1.0
- Gray Images of 1 X 64 X 200 (C H W) for trainable_v2.0/trainable_v2.1/deployable_v2.0/deployable_v2.1

## Output: <a class="anchor" name="output"></a>
**Output Type(s):** Sequence of characters <br>
**Output Format:** Character Id sequence: Text String(s) <br>
**Other Properties Related to Output:** None <br>

## Software Integration: <a class="anchor" name="software_integration"></a>
**Runtime(s):** NVIDIA AI Enterprise <br>
**Toolkit:** TAO Framework <br>
**Supported Hardware Platform(s):** Ampere, Jetson, Hopper, Lovelace, Pascal, Turing <br>
**Supported Operating System(s):** Linux <br>

# Training & Finetuning:

## Dataset: <a class="anchor" name="dataset"></a>

OCRNet pretrained model was trained on Uber-Text and TextOCR dataset. The Uber-Text contains street-level images collected from car mounted sensors and truths annotated by a team of image analysts. The TextOCR is the images with annotated texts from OpenImages dataset. After collecting the original data from Uber-text and TextOCR, we remove all the text images with `*` label in Uber-text and only keep alphanumeric text images with the maximum length is 25 in both datasets. We finally construct the dataset with 805007 text images for training and 24388 images for validation.

## Inference: <a class="anchor" name="inference"></a>
**Engine:** TensorRT <br>
**Test Hardware:** <br>
- Orin Nano
- Orin NX
- AGX Orin
- L4
- L40
- T4
- A2
- A30
- A100
- H100

# OCDNet Model Overview

## Model Overview <a class="anchor" name="model_overview"></a>

The model described in this card is an optical characters detection network, which aims to detect text in images. Trainable and deployable OCDNet models are provided. These are trained on Uber-Text dataset and ICDAR2015 dataset respectively.

## Terms of use <a class="anchor" name="terms_of_use"></a>
License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these [licenses](https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/).

## Model Architecture <a class="anchor" name="model_architecture"></a>

This model is based on a relatively sophisticated text detection network called DBNet. DBNet is a network architecture for real-time scene text detection with differentiable binarization. It aims to solve the problem of text localization and segmentation in natural images with complex backgrounds and various text shapes.

## Training <a class="anchor" name="training_algorithm"></a>

The training algorithm inserts the binarization operation into the segmentation network and jointly optimizes it so that
the network can learn to separate foreground and background pixels more effectively. The binarization threshold is learned by minimizing the IoU loss between the predicted binary map and the ground truth binary map.

### Training Data <a class="anchor" name="training_data"></a>

The trainable models were trained on the [Uber-Text](https://s3-us-west-2.amazonaws.com/uber-common-public/ubertext/index.html) dataset and [ICDAR2015](https://rrc.cvc.uab.es/?ch=4) dataset respectively. The Uber-Text dataset contains street-level images collected from car mounted sensors and truths annotated by a team of image analysts--including train_4Kx4K, train_1Kx1K, val_4Kx4K, val_1Kx1K, test_4Kx4K as the training datasets and test_1Kx1K as the validation dataset. The dataset was constructed with 107812 images for training and 10157 images for validation. The ICDAR2015 dataset contains 1000 training images and 500 test images. The deployable models were ONNX models that were exported using the trainable models.

## Performance <a class="anchor" name="performance"></a>

### Evaluation Data <a class="anchor" name="evaluation_data"></a>

The OCDNet model was evaluated using the Uber-Text test dataset and ICDAR2015 test dataset.

### Methodology and KPI

The key performance indicator is the hmean of detection. The KPI for the evaluation data are reported below.

|model|test dataset|hmean|
|---|---|---|
|ocdnet_deformable_resnet18|Uber-Text|81.1%|
|ocdnet_deformable_resnet50|Uber-Text|82.2%|
|ocdnet_fan_tiny_2x_ubertext.pth|Uber-Text|86.0%|
|ocdnet_fan_tiny_2x_icdar.pth|ICDAR2015|85.3%|
|ocdnet_fan_tiny_2x_icdar_pruned.pth|ICDAR2015|84.8%|
|ocdnet_vit_pcb.pth|Internal PCB validation|69.3%|

### Real-time Inference Performance <a class="anchor" name="realtime_inference_performance"></a>

The inference uses FP16 precision. The input shape is `<batch>x3x640x640`. The inference performance runs against an OCDNet-deployable model with [`trtexec`](https://github.com/NVIDIA/TensorRT/tree/master/samples/trtexec) on AGX Orin, Orin NX, Orin Nano, NVIDIA L4, NVIDIA L4, and NVIDIA A100 GPUs. The Jetson devices run at Max-N configuration for maximum system performance. The data is for inference-only performance. The end-to-end performance with streaming video data might vary slightly depending on the application's use case.

|Model|Device|precision|batch_size|FPS|
|---|---|---|---|---|
|ocdnet_deformable_resnet18|Orin Nano|FP16|32|31|
|ocdnet_deformable_resnet18|Orin NX|FP16|32|46|
|ocdnet_deformable_resnet18|AGX Orin|FP16|32|122|
|ocdnet_deformable_resnet18|T4|FP16|32|294|
|ocdnet_deformable_resnet18|L4|FP16|32|432|
|ocdnet_deformable_resnet18|A100|FP16|32|1786|
|ocdnet_fan_tiny_2x_icdar|Orin Nano|FP16|1|0.57|
|ocdnet_fan_tiny_2x_icdar|AGX Orin|FP16|1|2.24|
|ocdnet_fan_tiny_2x_icdar|T4|FP16|1|2.74|
|ocdnet_fan_tiny_2x_icdar|L4|FP16|1|5.36|
|ocdnet_fan_tiny_2x_icdar|A30|FP16|1|8.34|
|ocdnet_fan_tiny_2x_icdar|L40|FP16|1|15.01|
|ocdnet_fan_tiny_2x_icdar|A100-sxm4-80gb|FP16|1|16.61|
|ocdnet_fan_tiny_2x_icdar|H100-sxm-80gb-hbm3|FP16|1|29.13|
|ocdnet_fan_tiny_2x_icdar_pruned|Orin Nano|FP16|2|0.79|
|ocdnet_fan_tiny_2x_icdar_pruned|Orin NX|FP16|2|1.18|
|ocdnet_fan_tiny_2x_icdar_pruned|AGX Orin|FP16|2|3.08|
|ocdnet_fan_tiny_2x_icdar_pruned|A2|FP16|1|2.30|
|ocdnet_fan_tiny_2x_icdar_pruned|T4|FP16|2|3.51|
|ocdnet_fan_tiny_2x_icdar_pruned|L4|FP16|1|7.23|
|ocdnet_fan_tiny_2x_icdar_pruned|A30|FP16|2|11.37|
|ocdnet_fan_tiny_2x_icdar_pruned|L40|FP16|2|19.04|
|ocdnet_fan_tiny_2x_icdar_pruned|A100-sxm4-80gb|FP16|2|22.66|
|ocdnet_fan_tiny_2x_icdar_pruned|H100-sxm-80gb-hbm3|FP16|2|40.07|

## How to Use This Model <a class="anchor" name="how_to_use_this_model"></a>

This model needs to be used with NVIDIA Hardware and Software: The model can run on any NVIDIA GPU, including NVIDIA Jetson devices, with [TAO Toolkit](https://developer.nvidia.com/tao-toolkit), [DeepStream SDK](https://developer.nvidia.com/deepstream-sdk) or [TensorRT](https://developer.nvidia.com/tensorrt).

The primary use case for this model is to detect text on images.

There are two types of models provided (both unpruned).

- trainable
- deployable

The `trainable` models are intended for training with the user's own dataset using TAO Toolkit. This can provide high-fidelity models that are adapted to the use case. A Jupyter notebook is available as a part of the [TAO container](https://ngc.nvidia.com/catalog/containers/nvidia:tao:tao-toolkit) and can be used to re-train.

The `deployable` models share the same structure as the `trainable` model, but in `onnx` format. The `deployable` models can be deployed using TensorRT, nvOCDR, and DeepStream.

### Input

Images of C x H x W  (H and W should be multiples of 32.)

### Output

BBox or polygon coordinates for each detected text in the input image

### Instructions to Use the Model with TAO

To use these models as pretrained weights for transfer learning, use the snippet below as a template for the `model` component of the experiment spec file to train an OCDNet model. For more information on the experiment spec file, refer to the [TAO Toolkit User Guide](https://docs.nvidia.com/tao/tao-toolkit/index.html).

To use trainable_resnet18_v1.0 model:
```yaml
model:
load_pruned_graph: False
pruned_graph_path: '/results/prune/pruned_0.1.pth'
pretrained_model_path: '/data/ocdnet/ocdnet_deformable_resnet18.pth'
backbone: deformable_resnet18
```
To use trainable_ocdnet_vit_v1.0 model:
```yaml
model:
load_pruned_graph: False
pruned_graph_path: '/results/prune/pruned_0.1.pth'
pretrained_model_path: '/data/ocdnet/ocdnet_fan_tiny_2x_icdar.pth'
backbone: fan_tiny_8_p4_hybrid
enlarge_feature_map_size: True
activation_checkpoint: True
```

### Instructions to deploy the model with DeepStream

To create the entire end-to-end video analytic application, deploy this model with [DeepStream SDK](https://developer.nvidia.com/deepstream-sdk). DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of this model into the Deepstream sample app.

To deploy this model with [DeepStream](https://developer.nvidia.com/deepstream-sdk), follow [these instructions](https://docs.nvidia.com/tao/tao-toolkit/text/ds_tao/nvocdr_ds.html).

## Limitations <a class="anchor" name="limitations"></a>

### Restricted Usage in Different Fields

The NVIDIA OCDNet trainable model is trained on Uber Text, ICDAR2015 and PCB text dataset, which contains street-view images only. To get better accuracy in a specific field, more data is usually required to fine tune the pre-trained model with TAO Toolkit.

## Model versions:

- **trainable_resnet18_v1.0** - Pre-trained models with deformable-resnet18 backbone, trained on Uber-Text dataset.
- **trainable_resnet50_v1.0** - Pre-trained models with deformable-resnet50 backbone, trained on Uber-Text dataset.
- **trainable_ocdnet_vit_v1.0** - Pre-trained models with fan-tiny backbone, trained on ICDAR2015 dataset.
- **trainable_ocdnet_vit_v1.1** - Pre-trained models with fan-tiny backbone, trained on Uber-Text dataset.
- **trainable_ocdnet_vit_v1.2** - Pre-trained models with fan-tiny backbone, trained on PCB dataset.
- **trainable_ocdnet_vit_v1.3** - Pre-trained models with fan-tiny backbone, trained on ImageNet2012 dataset.
- **trainable_ocdnet_vit_v1.4** - Pre-trained models with fan-tiny backbone, trained on ICDAR2015 dataset and model are pruned.
- **deployable_v1.0** - Model deployable with deformable-resnet backbone.
- **deployable_v2.0** - Model deployable with fan-tiny backbone, trained on ICDAR2015.
- **deployable_v2.1** - Model deployable with fan-tiny backbone, trained on Uber-Text.
- **deployable_v2.2** - Model deployable with fan-tiny backbone, trained on PCB dataset.
- **deployable_v2.3** - Model deployable with fan-tiny backbone, trained on ICDAR2015 and model are pruned.

## Reference

### Citations <a class="anchor" name="citations"></a>

- Liao M., Wan Z., Yao C., Chen K., Bai X.: Real-time Scene Text Detection with Differentiable Binarization (2020).
- Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and Wei, Y: Deformable convolutional networks. (2017).
- He, W., Zhang, X., Yin, F., and Liu, C.: Deep direct regression for multi-oriented scene text detection. (2017).
- Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., & Kadlec, B. (2017, July). Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In SUNw: Scene Understanding Workshop-CVPR (Vol. 2017, p. 5).
- Zhou, D., Yu, Z., Xie, E., Xiao, C., Anandkumar, A., Feng, J., & Alvarez, J. M. (2022, June). Understanding the robustness in vision transformers. In International Conference on Machine Learning (pp. 27378-27394). PMLR.
- Kuo, C. W., Ashmore, J. D., Huggins, D., & Kira, Z. (2019, January). Data-efficient graph embedding learning for PCB component detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 551-560). IEEE.

## Using TAO Pre-trained Models <a class="anchor" name="using_tlt_pretrained_models"></a>

- Get [TAO Container](https://ngc.nvidia.com/catalog/containers/nvidia:tao:tao-toolkit)
- Get other purpose-built models from the NGC model registry:
- [TrafficCamNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:trafficcamnet)
- [PeopleNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet)
- [PeopleNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet)
- [PeopleNet-Transformer](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet_transformer)
- [DashCamNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:dashcamnet)
- [FaceDetectIR](https://ngc.nvidia.com/catalog/models/nvidia:tao:facedetectir)
- [VehicleMakeNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:vehiclemakenet)
- [VehicleTypeNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:vehicletypenet)
- [PeopleSegNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplesegnet)
- [PeopleSemSegNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplesemsegnet)
- [License Plate Detection](https://ngc.nvidia.com/catalog/models/nvidia:tao:lpdnet)
- [License Plate Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:lprnet)
- [Gaze Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:gazenet)
- [Facial Landmark](https://ngc.nvidia.com/catalog/models/nvidia:tao:fpenet)
- [Heart Rate Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:heartratenet)
- [Gesture Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:gesturenet)
- [Emotion Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:emotionnet)
- [FaceDetect](https://ngc.nvidia.com/catalog/models/nvidia:tao:facenet)
- [2D Body Pose Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:bodyposenet)
- [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet)
- [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet)
- [PoseClassificationNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:poseclassificationnet)
- [People ReIdentification](https://ngc.nvidia.com/catalog/models/nvidia:tao:reidentificationnet)
- [PointPillarNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pointpillarnet)
- [CitySegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/citysemsegformer)
- [Retail Object Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_detection)
- [Retail Object Embedding](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_recognition)
- [Optical Inspection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/optical_inspection)
- [Optical Character Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocdnet)
- [Optical Character Recognition](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocrnet)
- [PCB Classification](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pcb_classification)
- [PeopleSemSegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplesemsegformer)
- [LPDNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:lpdnet)
- [License Plate Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:lprnet)
- [Gaze Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:gazenet)
- [Facial Landmark](https://ngc.nvidia.com/catalog/models/nvidia:tao:fpenet)
- [Heart Rate Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:heartratenet)
- [Gesture Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:gesturenet)
- [Emotion Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:emotionnet)
- [FaceDetect](https://ngc.nvidia.com/catalog/models/nvidia:tao:facenet)
- [2D Body Pose Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:bodyposenet)
- [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet)
- [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet)
- [PoseClassificationNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:poseclassificationnet)
- [People ReIdentification](https://ngc.nvidia.com/catalog/models/nvidia:tao:reidentificationnet)
- [PointPillarNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pointpillarnet)
- [CitySegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/citysemsegformer)
- [Retail Object Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_detection)
- [Retail Object Embedding](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_recognition)
- [Optical Inspection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/optical_inspection)
- [Optical Character Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocdnet)
- [Optical Character Recognition](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocrnet)
- [PCB Classification](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pcb_classification)
- [PeopleSemSegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplesemsegformer)

## License <a class="anchor" name="license"></a>

The license to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these [licenses](https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/).

## Technical Blogs <a class="anchor" name="technical_blogs"></a>

- [Train like a ‘pro’ without being an AI expert using TAO AutoML](https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/)
- [Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning](https://developer.nvidia.com/blog/creating-custom-ai-models-using-nvidia-tao-toolkit-with-azure-machine-learning/)
- [Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO](https://developer.nvidia.com/blog/developing-and-deploying-ai-powered-robots-with-nvidia-isaac-sim-and-nvidia-tao/)
- Learn endless ways to adapt and supercharge your AI workflows with TAO - [Whitepaper](https://developer.nvidia.com/tao-toolkit-usecases-whitepaper/1-introduction)
- [Customize Action Recognition with TAO and deploy with DeepStream](https://developer.nvidia.com/blog/developing-and-deploying-your-custom-action-recognition-application-without-any-ai-expertise-using-tao-and-deepstream/)
- Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - [Part 1](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-1)  |  [Part 2](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-2)
- Learn how to traina  [real-time License plate detection and recognition app](https://developer.nvidia.com/blog/creating-a-real-time-license-plate-detection-and-recognition-app) with TAO and DeepStream.
- Model accuracy is extremely important. Learn how to achieve [state-of-the-art accuracy for classification and object detection models](https://developer.nvidia.com/blog/preparing-state-of-the-art-models-for-classification-and-object-detection-with-tao-toolkit/) using TAO.

## Suggested Reading <a class="anchor" name="suggested_reading"></a>

- More information about TAO Toolkit and pre-trained models can be found at the [NVIDIA Developer Zone](https://developer.nvidia.com/tao-toolkit).
- Read the [TAO Toolkit Quick Start Guide](https://docs.nvidia.com/tao/tao-toolkit/text/quick_start_guide/index.html) and [release notes](https://docs.nvidia.com/tao/tao-toolkit/text/release_notes.html).
- If you have any questions or feedback, please refer to the discussions on the [TAO Toolkit Developer Forums](https://forums.developer.nvidia.com/c/accelerated-computing/intelligent-video-analytics/tao-toolkit/17).
- Deploy your model on the edge using the [DeepStream SDK](https://developer.nvidia.com/deepstream-sdk).

## Ethical AI <a class="anchor" name="ethical_ai"></a>

The NVIDIA OCDNet model detects optical characters.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developers to ensure that it meets the requirements for the relevant industry and use case, that the necessary instructions and documentation are provided to understand error rates, confidence intervals, and results, and that the model is being used under the conditions and in the manner intended.

Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability)

## Bias

| Field | Response |
| -- | -- |
|Participation considerations from adversely impacted groups [(protected classes)](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | OCR: Not applicable <br> OCD: Not applicable |
| Measures taken to mitigate against unwanted bias: | OCR: Not Applicable <br> OCD: Not Applicable |

## Explainability

| Field | Response |
| -- | -- |
| Intended Application(s) & Domain(s): | OCR: This model is intended to be used in computer vision application for recognizing the optical characters in a scene. <br> OCD: This model is intended to be used for detecting text in images.  |
| Model Type: | OCR: This model is intended for developers who want to recognize/identify text from image data. <br> OCD: This model is intended for developers who want to detect optical characters from image data. |
| Intended Users: | OCR: This model is intended for developers who want to customize optical character recognition models. <br> OCD: This model is intended to be used for detecting text in images. |
| Output: | OCR: Sequence of characters <br> OCD: BBox or polygon coordinates for each detected text in the input image |
| Describe how the model works: | OCR: The training algorithm minimizes the connectionist temporal classification (CTC) loss between a ground truth character sequence from the image and a predicted characters sequence. Characters are decoded from the sequence output by a best-path decoding method. <br> OCD: Based on DBNet, a network architecture for real-time scene text detection, this model aims to solve the problem of text localization and segmentation in natural images with complex backgrounds and various text shapes. |
| Technical Limitations: | OCR: This model performs best on images representing its training set: Uber Text and TextOCR. Uber Text contains street view images. TextOCR has images with text in various scenes. Further fine-tuning might be required for domain-specific accuracy. <br> OCD: The NVIDIA OCDNet trainable model is trained on Uber Text, which contains street-view images only. Further fine-tuning might be required for domain-specific accuracy. |
| Verified to have met prescribed NVIDIA standards: | OCR: Yes <br> OCD: Yes |
| Performance Metrics: | OCR: Accuracy <br> OCD: hmean |
| Potential Risks: | OCR: None Known <br> OCR: None Known |
| Licensing: | [https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/](https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/) |

## Privacy

| Field | Response |
| -- | -- |
| Generatable or reverse engineerable personally-identifiable information (PII)? | OCR: Neither<br> OCD: Neither |
| If applicable, was a notice provided to the individuals prior to the collection of any personal data used? | OCR: No PII was used <br> OCD: Not applicable |
| How often is the dataset reviewed? | OCR: Before Release <br> OCD: Before Release |
| Is a mechanism in place to honor data subject right of access or deletion of personal data? |      OCR: No <br> OCD: No |
| If PII was collected for the development of the model, was it collected directly by NVIDIA? | OCR: No PII was collected <br> OCD: No PII was collected |
| If PII was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | OCR: No PII was collected <br> OCD: No PII was collected |
| If PII collected for the development of this AI model, was it minimized to only what was required? | OCR: No <br> OCD: No |
| Is there provenance for all datasets used in training? | OCR: No <br> OCD: No |
| Does data labeling (annotation, metadata) comply with privacy laws? | OCR: Yes <br> OCD: Yes|
| Is data compliant with data subject requests for data correction or removal, if such a request was made? |  OCR: No, not possible with externally-sourced data. <br> OCD: No, not possible with externally-sourced data.|
| Applicable NVIDIA Privacy Policy | [https://www.nvidia.com/en-us/about-nvidia/privacy-policy/](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) |

## Safety & Security

| Field | Response |
| -- | -- |
| Model Application(s): | OCD: Detect visual regions containing text <br> OCR: Recognize the text in visual data |
| Describe the life-critical application (if present). | OCR: No physical safety impacts. <br> OCD: No physical safety impacts. |
| Use Case Restrictions:  | OCD: Images that do not contain text <br> OCR: Images that do not contain text |
| Describe access restrictions (if any): | None. This was trained on public datasets. |

## Prototype

```python
import os
import sys
import uuid
import zipfile

import requests

# NVAI endpoint for the ocdrnet NIM
nvai_url="https://ai.api.nvidia.com/v1/cv/nvidia/ocdrnet"

header_auth = f"Bearer $NVIDIA_API_KEY"

def _upload_asset(input, description):
"""
Uploads an asset to the NVCF API.
:param input: The binary asset to upload
:param description: A description of the asset

"""
assets_url = "https://api.nvcf.nvidia.com/v2/nvcf/assets"

headers = {
"Authorization": header_auth,
"Content-Type": "application/json",
"accept": "application/json",
}

s3_headers = {
"x-amz-meta-nvcf-asset-description": description,
"content-type": "image/jpeg",
}

payload = {"contentType": "image/jpeg", "description": description}

response = requests.post(assets_url, headers=headers, json=payload, timeout=30)

response.raise_for_status()

asset_url = response.json()["uploadUrl"]
asset_id = response.json()["assetId"]

response = requests.put(
asset_url,
data=input,
headers=s3_headers,
timeout=300,
)

response.raise_for_status()
return uuid.UUID(asset_id)

if __name__ == "__main__":
"""Uploads an image of your choosing to the NVCF API and sends a
request to the Optical character detection and recognition model.
The response is saved to a local directory.

Note: You must set up an environment variable, NGC_PERSONAL_API_KEY.
"""

if len(sys.argv) != 3:
print("Usage: python test.py <image> <output_dir>")
sys.exit(1)

asset_id = _upload_asset(open(sys.argv[1], "rb"), "Input Image")

inputs = {"image": f"{asset_id}", "render_label": False}

asset_list = f"{asset_id}"

headers = {
"Content-Type": "application/json",
"NVCF-INPUT-ASSET-REFERENCES": asset_list,
"NVCF-FUNCTION-ASSET-IDS": asset_list,
"Authorization": header_auth,
}

response = requests.post(nvai_url, headers=headers, json=inputs)

with open(f"{sys.argv[2]}.zip", "wb") as out:
out.write(response.content)

with zipfile.ZipFile(f"{sys.argv[2]}.zip", "r") as z:
z.extractall(sys.argv[2])

print(f"Output saved to {sys.argv[2]}")
print(os.listdir(sys.argv[2]))
```

```javascript
const fs = require('fs');
const decompress = require("decompress");

const nvai_url="https://ai.api.nvidia.com/v1/cv/nvidia/ocdrnet"
const header_auth = `Bearer $NVIDIA_API_KEY`;

async function _upload_asset(input, description) {
const assets_url = "https://api.nvcf.nvidia.com/v2/nvcf/assets";

const headers = {
"Authorization": header_auth,
"Content-Type": "application/json",
"accept": "application/json",
};

const s3_headers = {
"x-amz-meta-nvcf-asset-description": description,
"content-type": "image/jpeg",
};

const payload = {
"contentType": "image/jpeg",
"description": description
};

const response = await fetch(
assets_url, { method: 'POST', body: JSON.stringify(payload), headers: headers }
);

const data = await response.json();

const asset_url = data["uploadUrl"];
const asset_id = data["assetId"];

const fileData = fs.readFileSync(input);

await fetch(
asset_url,
{ method: 'PUT', body: fileData, headers: s3_headers }
);

return asset_id.toString();
}

(async () => {
if (process.argv.length != 4) {
console.log("Usage: node test.js <image> <output_dir>");
process.exit(1);
}

// Upload specified user asset
const asset_id = await _upload_asset(`${process.argv[2]}`, "Input Image");

// Metadata for the request
// 'render_label' can be set to false if the user just wants the boxes without any labels
const inputs = { "image": asset_id, "render_label": false };
const asset_list = asset_id;
const headers = {
"Content-Type": "application/json",
"NVCF-INPUT-ASSET-REFERENCES": asset_list,
"NVCF-FUNCTION-ASSET-IDS": asset_list,
"Authorization": header_auth
};

// Make the request to nvcf
const response = await fetch(nvai_url, {
method: 'POST', body: JSON.stringify(inputs), headers: headers
});

// Gather the binary response data
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(arrayBuffer);

const zipname = `${process.argv[3]}.zip`;
fs.writeFileSync(zipname, buffer);

// Unzip the response synchronously
await decompress(zipname, process.argv[3]);

// Log the output directory and its contents
console.log(`Response saved to ${process.argv[3]}`);
console.log(fs.readdirSync(process.argv[3]));
})();
```

```bash
#!/bin/bash

set -e

# Check arguments
if [ "$#" -ne 2 ]; then
echo "Usage: ./test.sh <image> <output_dir>"
exit 1
fi

header_auth="Bearer $NVIDIA_API_KEY"
assets_url="https://api.nvcf.nvidia.com/v2/nvcf/assets"

nvai_url="https://ai.api.nvidia.com/v1/cv/nvidia/ocdrnet"

function upload_asset {
local input=$1
local description=$2

response=$(curl -s -X POST $assets_url \
-H "Authorization: ${header_auth}" \
-H "Content-Type: application/json" \
-H "accept: application/json" \
-d '{"contentType": "image/jpeg", "description": "'"${description}"'"}')

asset_url=$(echo ${response} | jq -r .uploadUrl)
asset_id=$(echo ${response} | jq -r .assetId)

curl -s -X PUT -H "x-amz-meta-nvcf-asset-description: ${description}" -H "content-type: image/jpeg" \
--data-binary "@${input}" ${asset_url}

echo ${asset_id}
}

asset_id=$(upload_asset "$1" "Input Image")
inputs='{"image": "'"${asset_id}"'", "render_label": false}'
asset_list="${asset_id}"

# The response will include a location header that curl doesn't trivially
# redirect to due to its complexity. We need to extract the location header
# and then download the file from that location.
location_command="curl -D - -s -X POST $nvai_url \
-H \"Content-Type: application/json\" \
-H \"NVCF-INPUT-ASSET-REFERENCES: ${asset_list}\" \
-H \"NVCF-FUNCTION-ASSET-IDS: ${asset_list}\" \
-H \"Authorization: ${header_auth}\" \
-d '${inputs}' | grep location | awk '{print \$2}'"

# Remove any newlines, carriage returns, spaces, quotes, and commas from the location header
location=$(eval ${location_command} | tr -d '\n' | tr -d '\r' | tr -d ' ' | tr -d '"' | tr -d ',')

# The download command will download the file from the location header
download_command="curl -s '${location}' > $2.zip"

# Download the .zip file
response=$(eval ${download_command})

# Unzip the file
unzip -q $2.zip -d $2

echo "Response saved to $2.zip"
echo $(ls $2)
```