NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

View All Playbooks
View All Playbooks

onboarding

  • Set Up Local Network Access
  • Open WebUI with Ollama

data science

  • Single-cell RNA Sequencing
  • Portfolio Optimization
  • CUDA-X Data Science
  • Text to Knowledge Graph
  • Optimized JAX

tools

  • VS Code
  • DGX Dashboard
  • Comfy UI
  • RAG Application in AI Workbench
  • Set up Tailscale on Your Spark

fine tuning

  • FLUX.1 Dreambooth LoRA Fine-tuning
  • LLaMA Factory
  • Fine-tune with NeMo
  • Fine-tune with Pytorch
  • Unsloth on DGX Spark

use case

  • Spark & Reachy Photo Booth
  • Live VLM WebUI
  • Install and Use Isaac Sim and Isaac Lab
  • Vibe Coding in VS Code
  • Build and Deploy a Multi-Agent Chatbot
  • Connect Two Sparks
  • NCCL for Two Sparks
  • Build a Video Search and Summarization (VSS) Agent

inference

  • LM Studio on DGX Spark
  • Nemotron-3-Nano with llama.cpp
  • Speculative Decoding
  • SGLang for Inference
  • TRT LLM for Inference
  • NVFP4 Quantization
  • Multi-modal Inference
  • NIM on Spark

Multi-modal Inference

1 HR

Setup multi-modal inference with TensorRT

View on GitHub
OverviewOverviewInstructionsInstructionsTroubleshootingTroubleshooting

Step 1
Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .

sudo usermod -aG docker $USER
newgrp docker

Step 2
Launch the TensorRT container environment

Start the NVIDIA PyTorch container with GPU access and HuggingFace cache mounting. This provides the TensorRT development environment with all required dependencies pre-installed.

docker run --gpus all --ipc=host --ulimit memlock=-1 \
--ulimit stack=67108864 -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/pytorch:25.11-py3

Step 3
Clone and set up TensorRT repository

Download the TensorRT repository and configure the environment for diffusion model demos.

git clone https://github.com/NVIDIA/TensorRT.git -b main --single-branch && cd TensorRT
export TRT_OSSPATH=/workspace/TensorRT/
cd $TRT_OSSPATH/demo/Diffusion

Step 4
Install required dependencies

Install NVIDIA ModelOpt and other dependencies for model quantization and optimization.

# Install OpenGL libraries
apt update
apt install -y libgl1 libglu1-mesa libglib2.0-0t64 libxrender1 libxext6 libx11-6 libxrandr2 libxss1 libxcomposite1 libxdamage1 libxfixes3 libxcb1

pip install nvidia-modelopt[torch,onnx]
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
pip3 install -r requirements.txt
pip install onnxconverter_common

Set up your HuggingFace token to access open models.

export HF_TOKEN = <YOUR_HUGGING_FACE_TOKEN>

Step 5
Run Flux.1 Dev model inference

Test multi-modal inference using the Flux.1 Dev model with different precision formats.

Substep A. BF16 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --download-onnx-models --bf16

Substep B. FP8 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --quantization-level 4 --fp8 --download-onnx-models

Substep C. FP4 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --fp4 --download-onnx-models

Step 6
Run Flux.1 Schnell model inference

Test the faster Flux.1 Schnell variant with different precision formats.

WARNING

FP16 Flux.1 Schnell requires >48GB VRAM for native export

Substep A. FP16 precision (high VRAM requirement)

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version="flux.1-schnell"

Substep B. FP8 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version="flux.1-schnell" \
  --quantization-level 4 --fp8 --download-onnx-models

Substep C. FP4 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version="flux.1-schnell" \
  --fp4 --download-onnx-models

Step 7
Run SDXL model inference

Test the SDXL model for comparison with different precision formats.

Substep A. BF16 precision

python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models

Substep B. FP8 quantized precision

python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models --fp8

Step 8
Validate inference outputs

Check that the models generated images successfully and measure performance differences.

# Check for generated images in output directory
ls -la *.png *.jpg 2>/dev/null || echo "No image files found"

# Verify CUDA is accessible
nvidia-smi

# Check TensorRT version
python3 -c "import tensorrt as trt; print(f'TensorRT version: {trt.__version__}')"

Step 9
Cleanup and rollback

Remove downloaded models and exit container environment to free disk space.

WARNING

This will delete all cached models and generated images

# Exit container
exit

# Remove HuggingFace cache (optional)
rm -rf $HOME/.cache/huggingface/

Step 10
Next steps

Use the validated setup to generate custom images or integrate multi-modal inference into your applications. Try different prompts or explore model fine-tuning with the established TensorRT environment.

Resources

  • DGX Spark Documentation
  • DGX Spark Forum
  • DGX Spark User Performance Guide