---
title: "nemotron-nano-12b-v2-vl"
publisher: "nvidia"
type: "endpoint"
updated: "2025-10-28T18:23:35.544Z"
description: "Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities."
canonical: "https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl"
---

# Model Overview
### Description:
NVIDIA Nemotron Nano 12B v2 VL model enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities.
<br>

This model is ready for commercial use. <br>

### License/Terms of Use
Governing Terms: The trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).

### Deployment Geography:
Global <br>

### Use Case: <br>
Nemotron Nano 12B V2 VL is a model for multi-modal document intelligence. It would be used by individuals or businesses that need to process documents such as invoices, receipts, and manuals. The model is capable of handling multiple images of documents, up to four images at a resolution of 1k x 2k each, along with a long text prompt. The expected use is for tasks like summarization and Visual Question Answering (VQA). The model is also expected to have a significant advantage in throughput. <br>

### Release Date:  <br>
HF [10/28/2025] via [URL](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16) <br>
Build.Nvidia.com [10/28/2025] via [URL](https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl) <br>

### System Prompt

Nemotron Nano 12B V2 VL supports reasoning for text, image and multi-images inputs.

Reasoning behavior is controlled via the system prompt. By default, reasoning is OFF.

For video inputs, reasoning is not supported.

- To enable reasoning (text and images only), include `/think` in the system prompt:
```
{"role": "system", "content": "/think"}
```
- To disable reasoning, include `/no_think` in the system prompt:
```
{"role": "system", "content": "/no_think"}
```

# Model Architecture:
**Architecture Type:**
Transformer  <br>
**Network Architecture:** 
Vision Encoder: CRadioV2-H
Language Encoder: NVIDIA-Nemotron-Nano-12B-v2
<br>

Number of model parameters: 12.6B<br>

## Input: <br>
**Input Type(s):**  Image, Video, Text<br>
**Input Format:** Image (png,jpg,jpeg,webp), Video (MP4, MOV, WEBM), Text (String)<br>
**Input Parameters:** Image (2D),Video(3D), Text (1D)  <br>
**Other Properties Related to Input:**
- Input Images Supported: 5
- Language Supported: English only <br>
- Input + Output Token: 128K
- Minimum Resolution: 32 × 32 pixels
- Maximum Resolution: Determined by a 12-tile layout constraint, with each tile being 512 × 512 pixels. This supports aspect ratios such as:
- 4 × 3 layout: up to 2048 × 1536 pixels
- 3 × 4 layout: up to 1536 × 2048 pixels
- 2 × 6 layout: up to 1024 × 3072 pixels
- 6 × 2 layout: up to 3072 × 1024 pixels
- Other configurations allowed, provided total tiles ≤ 12
- Channel Count: 3 channels (RGB)
- Alpha Channel: Not supported (no transparency) <br>
- Frames: 2 FPS with min of 8 frame and max of 128 frames

## Output: <br>
**Output Type(s):**  Text <br>
**Output Format:**  String <br>
Output Parameters: 1D <br>
**Other Properties Related to Output:**  Input + Output Token: 128K <br>

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br> 

## Software Integration:
**Runtime Engine(s):** 
* vLLM <br>
* TRT-LLM <br>

**Supported Hardware Microarchitecture Compatibility:** <br>
* NVIDIA L40S <br>
* NVIDIA A100 <br>
* NVIDIA B200 <br>
* NVIDIA H100/H200 <br>
* NVIDIA RTX PRO 6000 Server Edition<br>
* NVIDIA GH100 <br>
* NVIDIA GH200 <br>
* NVIDIA GB200 <br>

**Preferred/Supported Operating System(s):**
* Linux <br>

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>

## Model Version(s):
v1.0<br>

## Training, Testing, and Evaluation Datasets:

### Training Datasets: 

**Data Modalities** <br>
**Total Size:** 39.486.703 samples <br>
**Total Number of Datasets:** 270 <br>   
**Text-only datasets:** 33 <br>
**Text-and-image datasets:** 176 <br>
**Video-and-text datasets:** 61 <br>
**Total size:** 27.7 TB <br>

**Data modalities:** Text, Image, Video <br>
**Data Collection Method by dataset:** Hybrid: Automated, Human, Synthetic <br>
**Labeling Method by dataset:** Hybrid: Automated, Human, Synthetic <br>

**Dataset partition:** Training [100%], Testing [0%], Validation [0%] <br> 
**Time period for training data collection:** 2023-2025 <br> 
**Time period for testing data collection:** N/A <br> 
**Time period for validation data collection:** N/A <br>

The post-training datasets consist of a mix of internal and public datasets designed for training vision language models across various tasks. It includes: 

* Public datasets sourced from publicly available images and annotations, supporting tasks like classification, captioning, visual question answering, conversation modeling, document analysis and text/image reasoning.
* Internal text and image datasets built with public commercial images and internal labels, adapted for the same tasks as listed above.
* Synthetic image datasets generated programmatically for specific tasks like tabular data understanding and optical character recognition (OCR), for English, Chinese as well as other languages.
* Video datasets supporting video question answering and reasoning tasks from publicly available video sources, with either publicly available or internally generated annotations.
* Specialized datasets for safety alignment, function calling, and domain-specific tasks (e.g., science diagrams, financial question answering).
* NVIDIA-Sourced Synthetic Datasets for text reasoning.
* Private datasets for safety alignment or VQA on invoices.
* Crawled or scraped captioning, VQA, and video datasets.
* Some datasets were improved with Qwen2.5-72B-Instruct annotations

For around ~30% of our total training corpus and several of the domains listed above, we used commercially permissive models to perform:
* Language translation
* Re-labeling of annotations for text, image and video datasets
* Synthetic data generation
* Generating chain-of-thought (CoT) traces

Additional processing for several datasets included rule-based QA generation (e.g., with templates), expanding short answers into longer responses, as well as proper reformatting. More details can be found [here](https://arxiv.org/abs/2501.14818). 

** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>

# Public Datasets <br> 
| Type | Data Type | Total Samples | Total Size (GB) |
|------|-----------|---------------|------------------|
| Function call | text | 8,000 | 0.02 |
| Image Captioning | image, text | 1,422,102 | 1,051.04 |
| Image Reasoning | image, text | 1,888,217 | 286.95 |
| OCR | image, text | 9,830,570 | 5,317.60 |
| Referring Expression Grounding | image, text | 14,694 | 2.39 |
| Safety | image, text | 34,187 | 9.21 |
| Safety | text | 57,223 | 0.52 |
| Safety | video, text | 12,988 | 11.78 |
| Text Instruction Tuning | text | 245,056 | 1.13 |
| Text Reasoning | text | 225,408 | 4.55 |
| VQA | image, text | 8,174,136 | 2,207.52 |
| VQA | video, text | 40,000 | 46.05 |
| Video Captioning | video, text | 3,289 | 6.31 |
| Video Reasoning | video, text | 42,620 | 49.10 |
| VideoQA | video, text | 1,371,923 | 17,641.79 |
| Visual Instruction Tuning | image, text | 1,173,877 | 167.79 |
|------|-----------|---------------|------------------|
| **TOTAL** | | **24,544,290** | **26,803.75** |

<br>

# Private Datasets <br> 
| Type | Modalities | Total Samples | Total Size (GB) |
|------|------------|---------------|------------------|
| Image Reasoning | image, text | 17,729 | 15.41 |
| Text Reasoning | text | 445,958 | 9.01 |
|------|------------|---------------|------------------|
| **TOTAL** | | **463,687** | **24.42** |
<br>

# Data Crawling and Scraping <br> 
| Type | Modalities | Total Samples | Total Size (GB) |
|------|------------|---------------|------------------|
| Image Captioning | image, text | 39,870 | 10.24 |
| VQA | image, text | 40,348 | 3.94 |
| VideoQA | video, text | 288,728 | 393.30 |
|------|------------|---------------|------------------|
| **TOTAL** | | **368,946** | **407.48** |
<br>

# User-Sourced Data (Collected by Provider including Prompts) <br> 
<br>

# Self-Sourced Synthetic Data <br> 
| Type | Data Type | Total Samples | Total Size (GB) |
|------|-----------|---------------|------------------|
| Code | text | 1,165,591 | 54.15 |
| OCR | image, text | 216,332 | 83.53 |
| Text Reasoning | text | 12,727,857 | 295.80 |
|------|-----------|---------------|------------------|
| **TOTAL** | | **14,109,780** | **433.48** |
<br>

**Properties**<br>
* Additionally, the dataset collection (for training and evaluation) consists of a mix of internal and public datasets designed for training and evaluation across various tasks. It includes: 
* Internal datasets built with public commercial images and internal labels, supporting tasks like conversation modeling and document analysis.
* Public datasets sourced from publicly available images and annotations, adapted for tasks such as image captioning and visual question answering.
* Synthetic datasets generated programmatically for specific tasks like tabular data understanding.
* Specialized datasets for safety alignment, function calling, and domain-specific tasks (e.g., science diagrams, financial question answering).

### Evaluation Datasets:
The following external benchmarks are used for evaluating the model: <br>

| Dataset |
|---------|
| [RDTableBench](https://github.com/Filimoa/rd-tablebench?tab=readme-ov-file ) |
| NVIDIA internal test set for OCR |
| [MMMU Val with ChatGPT as judge](https://mmmu-benchmark.github.io/)  |
| [AI2D Test](https://prior.allenai.org/projects/diagram-understanding )  |
| [ChartQA Test](https://github.com/vis-nlp/ChartQA) |
| [InfoVQA Val](https://www.docvqa.org/datasets/infographicvqa) |
| [OCRBench](https://github.com/Yuliang-Liu/MultimodalOCR) |
| [OCRBenchV2](https://github.com/Yuliang-Liu/MultimodalOCR) English |
| [DocVQA Val](https://www.docvqa.org/datasets) |
| [SlideQA Val](https://github.com/nttmdlab-nlp/SlideVQA) |
| [Video MME](https://github.com/MME-Benchmarks/Video-MME)  |

Data Collection Method by dataset:  <br>
* Hybrid: Human, Automated <br>

Labeling Method by dataset:  <br>
* Hybrid: Human, Automated  <br>

**Properties (Quantity, Dataset Descriptions, Sensor(s)):** N/A <br>

**Dataset License(s):** N/A <br>

Evaluation benchmarks scores: <br>

| Benchmarks         | Score |
|--------------------|-------|
| MMMU*              | 68    |
| MathVista*         | 76.9  |
| AI2D               | 87.11 |
| OCRBenchv2         | 62.0  |
| OCRBench           | 85.6  |
| OCR-Reasoning      | 36.4  |
| ChartQA            | 89.72 |
| DocVQA             | 94.39 |
| Video-MME w/o sub  | 65.9  |
| Vision Average     | 74.0  |

<br>

# Inference:
**Acceleration Engine:** [vLLM] <br>
**Acceleration Engine:** [TRT-LLM] <br>

**Test Hardware:** <br>  
* NVIDIA L40S <br>
* NVIDIA A100 <br>
* NVIDIA B200 <br>
* NVIDIA H100/H200 <br>
* NVIDIA RTX PRO 6000 Server Edition<br>
* NVIDIA GH200 <br>
* NVIDIA GB200 <br>

## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.<br> Please report security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
<br>Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.<br>
Outputs generated by these models may contain political content or other potentially misleading information, issues with content security and safety, or unwanted bias that is independent of our oversight.

## Bias

| Field | Response |
|:---|:---|
| Participation considerations from adversely impacted groups [protected classes](https://calcivilrights.ca.gov/disputeresolution/protected-characteristics/) in model design and testing: | None |
| Bias Metric (If Measured): | [BBQ Accuracy Scores in Ambiguous Contexts](https://github.com/nyu-mll/BBQ/) |
| Which characteristic (feature) show(s) the greatest difference in performance?: | The model shows high variance across many characteristics when used at a high temperature, with the greatest measurable difference seen in categories such as Gender Identity and Race x Gender. |
| Which feature(s) have the worst performance overall? | Age (ambiguous) has both the lowest category accuracy listed (0.75) and a notably negative bias score (–0.56), indicating it is the worst-performing feature overall in this evaluation. |
| Measures taken to mitigate against unwanted bias: | None |
| If using internal data, description of methods implemented in data acquisition or processing, if any, to address the prevalence of identifiable biases in the training, testing, and validation data: | The training datasets contain a large amount of synthetic data generated by LLMs. We manually curated prompts. |
| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | Bias Benchmark for Question Answering (BBQ) |
| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | The datasets, which include video datasets (e.g., YouCook2, VCG Human Dataset) and image captioning datasets, do not collectively or exhaustively represent all demographic groups (and proportionally therein).
For instance, these datasets do not contain explicit mentions of demographic classes such as age, gender, or ethnicity in over 80% of samples. In the subset where analysis was performed, certain datasets contain skews in the representation of participants—for example, perceived gender of "female" participants may be significant compared to "male" participants for certain datasets. Separately, individuals aged "40 to 49 years" and “20 to 29 years” are the most frequent among ethnic identifiers. Toxicity analysis was additionally performed on several datasets to identify potential not-safe-for-work samples and risks.
To mitigate these imbalances, we recommend considering evaluation techniques such as bias audits, fine-tuning with demographically balanced datasets, and mitigation strategies like counterfactual data augmentation to align with the desired model behavior. This evaluation was conducted on a data subset ranging from 200 to 3,000 samples per dataset; as such, certain limitations may exist in the reliability of the embeddings. A baseline of 200 samples was used across all datasets, with larger subsets of up to 3,000 samples utilized for certain in-depth analyses.
|

## Explainability

Field                                                                                                  |  Response
:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
Intended Task/Domain:                                                                   |  Visual Question Answering
Model Type:                                                                                            |  Transformer
Intended Users:                                                                                        | Individuals and businesses that need to process documents such as invoices, receipts, and manuals. Also, users who are building multi-modal agents and RAG systems.
Output:                                                                                                |  Text 
Tools used to evaluate datasets to identify synthetic data and ensure data authenticity. | We used a Gemma-3 4B-based filtering model fine-tuned on [Nemotron Content Safety Dataset v2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) to ensure the quality of synthetic data.
Describe how the model works:                                                                          | Vision Encoder and a Nemotron 5.5H -12B Language Encoder. It processes multiple input modalities, including text, multiple images, and video. It fuses these inputs and uses its large language model backbone with a 128K context length to perform visual Q&A, summarization, and data extraction.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:  |  Not Applicable
Technical Limitations & Mitigation:                                                                    |  The model has a limited maximum resolution determined by a 12-tile layout constraint, where each tile is 512x512 pixels. It also supports a limited number of input images (up to 4) and has a maximum context length of 128K tokens for combined input and output.
Verified to have met prescribed NVIDIA quality standards:  |  Yes
Performance Metrics:                                                                                   |  Accuracy (Visual Question Answering), Latency, Throughput
Potential Known Risks:                                                                                 | The Model may produce output that is biased, toxic, or incorrect responses. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The Model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text, producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.While we have taken safety and security into account and are continuously improving, outputs may still contain political content, misleading information, or unwanted bias beyond our control.
Licensing:                                                                                             |  Governing Terms: Use of this model is governed by the [ NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)

## Privacy

Field                                                                                                                              |  Response
:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
Generatable or reverse engineerable personal data?                                                     |  No
Personal data used to create this model?                                                                                       |  No
Was consent obtained for any personal data used?                                                                                             |  Not Applicable
A description of any methods implemented in data acquisition or processing, if any, to address the prevalence of personal data in the training data, where relevant and applicable. | We used only prompts that do not contain any personal data for synthetic data generation. 
How often is dataset reviewed?                                                                                                     |  Before release and during dataset creation and model training <br><br>
Is there provenance for all datasets used in training?                                                                                |  Yes
Does data labeling (annotation, metadata) comply with privacy laws?                                                                |  Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made?                           |  No, not possible with externally-sourced data.
Applicable Privacy Policy        | [Privacy Policy](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/)
During AI model development, strict adherence to copyright policy ensured compliance through risk mitigation and legal reviews. Post-data collection, reserved rights content is identified and removed, with verified opt-out processes for rightsholders. Detailed records document due diligence and transparency.
We employ automated tools and data processing techniques to scan for Personally Identifiable Information (PII) during pre-training to identify and filter certain categories of personal information, including public-facing contact details such as email addresses and phone numbers. Scans of Common Crawl, CC-News, and Wikimedia datasets did not detect PII in the majority of samples. However, Microsoft Presidio indicated potential findings including business contact information embedded in natural language, such as email addresses and phone numbers. These were removed using verified instances of PII through a combination of automated filtering and human-in-the-loop validation.

## Safety & Security

Field                                               |  Response
:---------------------------------------------------|:----------------------------------
Model Application Field(s):                               |  Customer Service, Media & Entertainment, Enterprise Document Intelligence and Processing & Retail 
Describe the life critical impact (if present).   |  Not Applicable 
Description of methods implemented in data acquisition or processing, if any, to address other types of potentially harmful data in the training, testing, and validation data: | We used a guard model for content safety to exclude potentially harmful data from  training. 
Description of any methods implemented in data acquisition or processing, if any, to address illegal or harmful content in the training data, including, but not limited to, child sexual abuse material (CSAM) and non-consensual intimate imagery (NCII) | We used a Gemma-3 4B-based guard model trained on [Nemotron Content Safety Dataset v2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) for content safety to exclude potentially illegal or harmful content from the training. We also did CSAM checks on our image datasets for training.
Use Case Restrictions:                              |  Use of this model is governed by the [ NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
Model and dataset restrictions:            |  The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development.  Restrictions enforce dataset access during training, and dataset license constraints adhered to.
This AI model was developed based on our policies to ensure responsible data handling and risk mitigation. The datasets used for training have been scanned for harmful content and illegal content, consistent with our policies including scanning for Child Sexual Abuse Material (CSAM). Ongoing review and monitoring mechanisms are in place based on our policies and to maintain data integrity.
The model was optimized explicitly for instruction following and as such is more susceptible to prompt injection and jailbreaking in various forms as a result of its instruction tuning. This means that the model should be paired with additional rails or system filtering to limit exposure to instructions from malicious sources -- either directly or indirectly by retrieval (e.g. via visiting a website) -- as they may yield outputs that can lead to harmful, system-level outcomes up to and including remote code execution in agentic systems when effective security controls including guardrails are not in place. The model may generate answers that may be inaccurate, omit key information, include irrelevant or redundant text, or produce socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

## Prototype

```python
import requests

invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions"

headers = {
"Authorization": "Bearer ",
"Accept": "application/json",
}

payload = {
"messages": [
{
"role": "user",
"content": ""
}
]
}

# re-use connections
session = requests.Session()

response = session.post(invoke_url, headers=headers, json=payload)

response.raise_for_status()
response_body = response.json()
print(response_body)
```

```javascript
import fs from 'fs';
import OpenAI from 'openai';
import path from 'path';

const stream = ;
const query = "Describe the scene";

const kApiKey = "$NVIDIA_API_KEY";

const openai = new OpenAI({
apiKey: kApiKey,
baseURL: 'https://integrate.api.nvidia.com/v1',
});

const kSupportedList = {
"png": ["image/png", "image_url"],
"jpg": ["image/jpeg", "image_url"],
"jpeg": ["image/jpeg", "image_url"],
"webp": ["image/webp", "image_url"],
"mp4": ["video/mp4", "video_url"],
"webm": ["video/webm", "video_url"],
"mov": ["video/mov", "video_url"]
};

// Get file extension
function getExtension(filename) {
const ext = path.extname(filename).toLowerCase();
return ext.slice(1); // remove the leading dot
}

// Get MIME type
function mimeType(ext) {
return kSupportedList[ext][0];
}

// Get media type
function mediaType(ext) {
return kSupportedList[ext][1];
}

// Encode media file to base64
function encodeMediaBase64(mediaFile) {
const fileBuffer = fs.readFileSync(mediaFile);
return fileBuffer.toString('base64');
}

// Chat with media
async function chatWithMedia(mediaFiles, query, stream = false) {
let hasVideo = false;
let content;

// Build content based on whether we have media files
if (mediaFiles.length === 0) {
// Text-only mode
content = query;
} else {
// Build content array with text and media
content = [{ type: "text", text: query }];

for (const mediaFile of mediaFiles) {
const ext = getExtension(mediaFile);
if (!(ext in kSupportedList)) {
throw new Error(`${mediaFile} format is not supported`);
}

const mediaTypeKey = mediaType(ext);
if (mediaTypeKey === "video_url") {
hasVideo = true;
}

console.log(`Encoding ${mediaFile} as base64...`);
const base64Data = encodeMediaBase64(mediaFile);

// Add media to content array
const mediaObj = {
type: mediaTypeKey,
[mediaTypeKey]: {
url: `data:${mimeType(ext)};base64,${base64Data}`
}
};
content.push(mediaObj);
}

if (hasVideo && mediaFiles.length !== 1) {
throw new Error("Only a single video is supported.");
}
}

// Videos only support /no_think, images support both

const systemPrompt = hasVideo ? "/no_think" : "/think";

const messages = [
{
"role": "system",
"content": systemPrompt
},
{
"role": "user",
"content": content
}
];

const payload = {
max_tokens: 1024,
temperature: 0.2,
top_p: 0.7,
frequency_penalty: 0,
presence_penalty: 0,
messages: messages,
stream: stream,
model: "nvidia/nemotron-nano-12b-v2-vl"
};

// Use OpenAI client
if (stream) {
const completion = await openai.chat.completions.create(payload);

for await (const chunk of completion) {
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
} else {
const completion = await openai.chat.completions.create(payload);
console.log(JSON.stringify(completion));
}
}

// Main function to run the script
async function main() {
/** Usage:
*  node test.js                                    # Text-only
*  node test.js sample.mp4                         # Single video
*  node test.js sample1.png sample2.png            # Multiple images
*/
const args = process.argv.slice(2);
const mediaSamples = args;
await chatWithMedia(mediaSamples, query, stream);
}

main();
```

```bash
#!/bin/bash
stream=

invoke_url="https://integrate.api.nvidia.com/v1/chat/completions"
query="Describe the scene"

# supported table: ext:mime_type,media_type
kSupportedList=("png:image/png,image_url" "jpg:image/jpeg,image_url" "jpeg:image/jpeg,image_url" "webp:image/webp,image_url" "mp4:video/mp4,video_url" "webm:video/webm,video_url" "mov:video/mov,video_url")

# get "mime,media_type"
get_media_info() {
local ext="$1"
for item in "${kSupportedList[@]}"; do
if [[ "$item" == "$ext:"* ]]; then
# Return "mime,media_type"
echo "${item#*:}"
return
fi
done
echo ""
}

get_extension() {
filename="$1"
echo "${filename##*.}" | tr '[:upper:]' '[:lower:]'
}

# Encode file to base64
encode_base64() {
base64 -i "$1" | tr -d '\n'
}

chat_with_media() {
infer_url="$1"
query="$2"
shift 2
media_files=("$@")

has_video=false

# Build content based on whether we have media files
if [ "${#media_files[@]}" -eq 0 ]; then
# Text-only mode
content_type="string"
content_value="$query"
else
# Build content array starting with text
content_type="array"
content_items=$(jq -n --arg query "$query" '[{"type": "text", "text": $query}]')

for media_file in "${media_files[@]}"; do
ext=$(get_extension "$media_file")

media_info=$(get_media_info "$ext")
if [ -z "$media_info" ]; then
echo "$media_file format is not supported"
exit 1
fi
mime_type="${media_info%%,*}"
media_type_key="${media_info#*,}"

if [[ "$media_type_key" == "video_url" ]]; then
has_video=true
fi

echo "Encoding $media_file as base64..."
base64_data=$(encode_base64 "$media_file")

# Build media object and add to content array
# Use temp files to avoid error with large files
temp_data=$(mktemp)
temp_obj=$(mktemp)
temp_content=$(mktemp)

echo "$base64_data" > "$temp_data"

# Create media object in temp file
jq -n \
--arg type "$media_type_key" \
--arg mime "$mime_type" \
--rawfile data "$temp_data" \
'{type: $type, ($type): {url: ("data:\($mime);base64," + ($data | gsub("\n"; "")))}}' > "$temp_obj"

# Add media object to content array
echo "$content_items" | jq --slurpfile obj "$temp_obj" '. += $obj' > "$temp_content"
content_items=$(cat "$temp_content")

rm -f "$temp_data" "$temp_obj" "$temp_content"
done

if $has_video && [ "${#media_files[@]}" -gt 1 ]; then
echo "Only single video supported."
exit 1
fi
fi

# Videos only support /no_think, images support both

if $has_video; then
system_prompt="/no_think"
else
system_prompt="/think"
fi

headers=(
-H "Authorization: Bearer $NVIDIA_API_KEY"
-H "Content-Type: application/json"
)
if [ "$stream" = true ]; then
headers+=(-H "Accept: text/event-stream")
else
headers+=(-H "Accept: application/json")
fi

# Build payload based on content type
temp_payload_file=$(mktemp)

if [ "$content_type" = "string" ]; then
# Text-only payload
jq -n \
--arg query "$content_value" \
--argjson stream "$stream" \
--arg system_prompt "$system_prompt" \
'{
max_tokens: 1024,
temperature: 0.2,
top_p: 0.7,
frequency_penalty: 0,
presence_penalty: 0,
messages: [
{"role": "system", "content": $system_prompt},
{"role": "user", "content": $query}
],
stream: $stream,
model: "nvidia/nemotron-nano-12b-v2-vl"
}' > "$temp_payload_file"
else
# Media payload
temp_content_file=$(mktemp)
echo "$content_items" > "$temp_content_file"

jq -n \
--slurpfile content "$temp_content_file" \
--argjson stream "$stream" \
--arg system_prompt "$system_prompt" \
'{
max_tokens: 1024,
temperature: 0.2,
top_p: 0.7,
frequency_penalty: 0,
presence_penalty: 0,
messages: [
{"role": "system", "content": $system_prompt},
{"role": "user", "content": $content[0]}
],
stream: $stream,
model: "nvidia/nemotron-nano-12b-v2-vl"
}' > "$temp_payload_file"

rm -f "$temp_content_file"
fi

response=$(curl -s -X POST "$infer_url" "${headers[@]}" -d @"$temp_payload_file")

rm -f "$temp_payload_file"

if [ "$stream" = true ]; then
echo "$response" | while IFS= read -r line; do
echo "$line"
done
else
echo "$response" | jq .
fi
}

# Usage examples:
# $0                                    # Text-only
# $0 sample.mp4                         # Single video
# $0 sample1.png sample2.png            # Multiple images

chat_with_media "$invoke_url" "$query" "$@"
```