Try NVIDIA NIM APIs

Use to flash a promoted BSP image to a Jetson DUT in RCM mode via flash.sh or l4t_initrd_flash.sh. Do NOT use for BSP customization, image promotion, or carrier derivation.

Skill

Developer

920

28d

Two-step image grounding pipeline: extracts referring expressions from (image, caption) pairs and grounds them to pixel-space bounding boxes via a VLM. Use when the user wants to ground captions to bboxes, generate phrase-grounded annotations, auto-label

Skill

Developer

1mo

Use after jetson-flash-image to run static BSP checks, on-target smoke/regression tests on a flashed DUT, or both. Not for build or flash steps. Triggers: validate bsp, on-target validation.

Skill

Developer

913

28d

PyTorch-based TAO image classification. Supports a wide range of backbones (FAN, EfficientNet, ResNet, etc.) with distillation and quantization for deployment. Use when training, evaluating, distilling, quantizing, exporting, or running inference for a TA

Skill

Developer

1mo

Extract Jetson Linux + sample-rootfs tarballs and run apply_binaries.sh for the active target, then record bsp_image in the profile. Use after jetson-init-target; not for source-tree setup.

Skill

Developer

916

28d

NVIDIA

Downloadable

qwen-image-edit-nvpcb-ovsl2sl

An image edit model specialized for Omniverse synthetic to photographic solder-light style captured at NVIDIA PCB inspection stations

Model

Synthetic Data Generation

17d

Use when the user wants to orchestrate defect image generation with NVIDIA Cosmos AnomalyGen (Cosmos-Predict2-derived) on OSMO for PCBA, metal surface, and glass inspection. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and A

Skill

Developer

1mo

DGX Station

45 MIN

Image & Video Generation with ComfyUI

Generate images and videos with FLUX, Wan 2.1, HunyuanVideo, and Cosmos on DGX Station

Playbook

Image Generation

1mo

Runs the DEFT embed-then-mine workflow for VCN AOI iterations — embeds the gap-analysis target parquet, embeds a source pool, and mines nearest-neighbour source images for downstream augmentation. Use as the immediate next step after `tao-route-visual-cha

Skill

Developer

1mo

DGX Spark

1 HR

FLUX.1 Dreambooth LoRA Fine-tuning

Fine-tune FLUX.1-dev 12B model using Dreambooth LoRA for custom image generation

Playbook

Image Generation

9mo

llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

Qwen

DownloadableFree Endpoint

qwen3.5-397b-a17b

Next-gen Qwen 3.5 VLM (400B MoE) brings advanced vision, chat, RAG, and agentic capabilities.

Model

MoE

16M

5mo

llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

NVIDIA

DownloadableFree Endpoint

llama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

Model

doc intelligence

14M

Mistral AI

DownloadableFree Endpoint

mistral-small-4-119b-2603

Hybrid MoE model unifying instruct, reasoning, and coding with multimodal input and 256k context

Model

code generation

11M

4mo

NVIDIA

Free Endpoint

cosmos3-nano

Generates physics-aware videos from text prompts or an image prompt for physical AI development.

Model

autonomous vehicles

1mo

DGX Spark

1 HR

Vision-Language Model Fine-tuning

Fine-tune Vision-Language Models for image and video understanding tasks using Qwen2.5-VL and InternVL3

Playbook

DGX

9mo

Black-forest-labs

Downloadable

flux.2-klein-4b

FLUX.2-klein-4B is a distilled image generation and editing model, producing outputs at lighting speed

Model

image editing

271K

4mo

Microsoft

Downloadable

TRELLIS

MSFT TRELLIS is a 3D AI model that generates high-quality 3D assets from text or image inputs.

Model

text-to-3d

10mo

Thinkingmachines

DownloadableFree Endpoint

inkling

Inkling is a multimodal (text + image) reasoning model from Thinking Machines — a Mamba-hybrid, 256-expert Mixture-of-Experts architecture with tool use and switchable reasoning.

Model

text-to-text

Google

Free Endpoint

paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

Model

image

12K