Setup multi-modal inference with TensorRT
Multi-modal inference combines different data types, such as text, images, and audio, within a single model pipeline to generate or interpret richer outputs.
Instead of processing one input type at a time, multi-modal systems have shared representations that text-to-image generation, image captioning, or vision-language reasoning.
On GPUs, this enables parallel processing across modalities for faster, higher-fidelity results for tasks that combine language and vision.
You'll deploy GPU-accelerated multi-modal inference capabilities on NVIDIA Spark using TensorRT to run Flux.1 and SDXL diffusion models with optimized performance across multiple precision formats (FP16, FP8, FP4).
nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi
All necessary files can be found in the TensorRT repository here on GitHub
Duration: 45-90 minutes depending on model downloads and optimization steps
Risks:
Rollback: